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Description 

[0001] The present invention relates to a method for developing context-dependent models for automatic speech 
recognition. 

5 [0002] Small vocabulary speech recognition systems have as their basic units the words in the small vocabulary to 
be recognized. For Instance, a system for recognizing the English alphabet will typically have twenty-six models, one 
model per letter of the alphabet. This approach is impractical for medium and large vocabulary speech recognition 
systems. These larger systems typically talce as their basic units the phonemes or syllables of a language, if a system 
contains one model (for example a Hidden Markov Model) per phoneme of a language, It is called a system with 

10 "context-Independent" acoustic models. 

[0003] If a system employs different models for a given phoneme, depending on the Identity of the surrounding 
phonemes, the system is said to employ "context-dependent" acoustic models. An allophone is a specialised version 
of a phoneme defined by its context. For instance, all the instances of "ae" pronounced before "t", as in "baf , "fat", etc 
define an allophone of "ae". 

IS [0004] For most languages, the acoustic realization of a phoneme depends very strongly on the preceding and fol- 
lowing phonemes. For example, an 'eh" preceded by a "y" (as in "yes") Is quite different from an "eh" preceded by "s" 
(as in "set"). Thus, for a system with a medium-sized or large vocabulary, the perfonnance of context-dependent acous- 
tic models is much better than that of context-independent models. Most practical applications of medium and large 
vocabulary recognition systems employ context-dependent acoustic models today 

20 [0005] Many context-dependent recognition systems today employ decision tree clustering to define the context- 
dependent, speaker-independent acoustic models. A tree-growing algorithm finds questions about the phonemes sur- 
rounding the phoneme of Interest and splits apart acoustically dissimilar examples of the phoneme of interest. The 
result is a decision tree of yes-no questions for selecting the acoustic model that will best recognize a given allophone. 
Typically, the yes-no questions pertain to how the allophone appears in context (ie what are its neighbouring phonemes). 

25 [0006] A conventional decision tree defines for each phonenrie a binary tree containing yes/no questions in the root 
node and in each Intermediate node (children, grandchildren, etc of the root node). The terminal nodes, or leaf nodes, 
contain the acoustic models designed for particular allophones of the phoneme. Thus, In use, the recognition system 
traverses the tree, branching "yes" or "no" based on the context of the phoneme In question until the leaf node containing 
the application model is identified. Thereafter the identified model is used for recognition. 

30 [0007] Unfortunately, conventional allophone modelling can go wrong. The applicants believe that this is because 
cun^ent models do not take into account the particular idiosyncrasies of each training speaker. Current methods assume 
that individual speaker Idiosyncrasies will be averaged out If a large pool of training speakers Is used. However, In 
practice, it has been found that this assumption does not always hold. Conventional decision tree-based allophone 
models work fairly well when a new speaker's speech happens to resemble the speech of the training speaker popu- 

35 lation. However, conventional techniques break down when the new speaker's speech lies outside the domain of the 
training speaker population. 

[0008] The thesis of Robert Westwood entitled "Speaker Adaptation using Eigenvoices" of 31 August 1 999, in con- 
nection with work being done through Wolfson College, Department of Engineering, Cambridge University, Cambrkjge, 
United Kingdom XP0021 7601 8 describes a technique for the adaptation of speaker models. Examples are shown and 
40 a conclusion is reached to the effect that for the Eigenvoice decomposition technique disclosed to be effective, it is 
important to mode! tiie interspeaker variability well. ' 

[0009] According to an aspect of the present invention, there Is provided a method of tiie aforesaki type, comprising 
generating an elgenspace to represent a training speaker population; providing a set of acoustic data for at least one 
training speaker and representing said acoustic data in said elgenspace to detemriine at least one allophone centroid 
45 for said training speaker; subtracting said centroid from said acoustic data to generate speaker-adjusted acoustic data 
for said training speaker; and using said speaker-adjusted acoustic data to grow at least on decision tree having leaf 
nodes containing context-dependent models for different allophones. 

[001 0] The invention will know be described by way of example only, wrtti reference to the accompanying drawings, 
of which: 

so 

Figure 1 is a diagrammatic illustration of speaker space useful in understanding how the centroid of a speaker 
population and the associated allophone vectors differ from speaker to speaker; 

Figure 2 is a block diagram of a first presently preferred embodiment called the elgencentroid plus delta tree 
embodiment; 

55 Figure 3 illustrates one embodiment of a speech recognizer that utilises tiie delta decision ti'ee developed from 
the embodiment Illustrated in Figure 2; 

Figure 4 is another embodiment of a speech recognizer that also uses tiie delta decision trees generated by the 
embodiment of Figure 2; 
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Figure 5 illustrates how a delta tree might be constructed using the speaker-adjusted data generated by the em- 
bodiment of Figure 2; 

Figure 6 shows the grouping of spealcer-adjusted data In acoustic space con-esponding to the delta tree of Figure 5; 
Figure 7 illustrates an exemplary delta decision tree that includes questions about the eigenspace dimensions; and 
s . Figure 8 illustrates a second emt)odiment of the invention, useful in applications where there Is more complete 
quantity of data per speaker. 

[0011] The techniques of the invention may be applied to a variety of different speech recognition problems. The 
techniques are perhaps most useful in medium and large vocabulary applications, where it is not feasible to represent 
10 each full word by Its own model. Two embodiments of the invention will be described here. It will be understood that 
the principles of the invjentlon may be extended to other embodiments, as well. 

[0012] The first embodiment is optimized for applications where each training speaker has supplied a moderate 
amount of training data: for example, on the order of twenty to thirty minutes of training data per speaker. With this 
quantity of training data it is expected that there will be enough acoustic speech examples to construct reasonably 
15 good context independent, speaker dependent models for each speaker. If desired, speaker adaptation techniques 
can be used to generate sufficient data for training the context independent models. Although it. Is not necessary to 
have a full set of examples of all allophones for each speaker, the data should reflect the most Important allophones 
for each phoneme somewhere In the data (i.e., the allophones have been pronounced a number of times by at least 
a small number of speakers). 

.20 [0013] The recognition system of this embodiment employs decision trees for identifying the appropriate model for 
each allophone, based on the context of tiiat allophone (based on its neighboring phonemes, for example). However, 
unlike conventional decision tree-based modeling systems, this embodiment uses speaker-adjusted training data in 
the construction of the decision trees. The speaker adjusting process, in effect, removes the particular idiosyncrasies 
of each training speaker's speech so that better allophone models can be generated. Then, when tiie recognition 

25 system Is used, a similar adjustment is made to the speech of the new speaker, whereby the speaker-adjusted allophone 
models may be accessed to perfomri high quality, context dependent recognition. 

[001 4] An important component of the recognition system of tills embodiment is ttie Eigenvoice technique by which 
the training speaker's speech, and the new speaker's speech, may be rapidly analyzed to extract individual speaker 
idiosyncrasies. The Eigenvoice technique, discussed more fully below, defines a reduced dimensionality Eigenspace 
30 that collectively represents the ti-aining speaker population. When the new speaker speaks during recognition, his or 
her speech is rapidly placed or projected into the Eigenspace to very quickly determine how that speaker's speech 
"centroid" falls in speaker space relative to the training speakers. 

[0015] As will be fully explained, the new speaker's centroid (and also each training speaker's centroid) Is defined 
by how, on average, each speaker utters the phonemes of tfie system. For convenience, one can think of the centroid 

35 vector as consisting of the concatenated Gaussian mean vectors for each state of each phoneme HMM In a context 
independent model for a given speaker. However, the concept of "centroid" is scalable and it depends on how much 
data Is available per training speaker. For Instance, if there is eriough training data to train a somewhat richer speaker 
dependent model for each speaker (such as a diphone model), then the centroid for each training speaker could be 
the concatenated Gaussian means from this speaker dependent diphone model. Of course, otiier models such as 

40 triphone models and the like, may also be implemented. 

[001 8] Figure 1 illustrates the concept of tfie centroids by showing diagrammatic€llly how six different training speak- 
ers A-F may pronounce phoneme 'ae' in different contexts. Figure 1 illusti'ates a speaker space that is diagrammaticaily 
shown for convenience as a two-dimensional space in which each speaker's centroid lies in ttie two-dImensk)nal space 
at the center of the allophone vectors collected for ttiat speaker. Thus, in Figure 1 , the centroid of speaker A lies at tiie 

^ origin of the respective allophone vectors derived as speaker A uttered tiie following words: "mass", "lack", and "had". 
Thus the centi-oid for speaker A contains information that in rough tenns represents tiie "average" phoneme 'ae' for 
that speaker. 

[0017] By comparison, ttie centroid of speaker B lies to tiie right of speaker A in speaker space. Speaker B's centroid 
has been generated by the folbwing utterances: "laugh", "rap," and "bag". As illustrated, ttie otiier speakers C-F lie in 

so other regionis within the speaker space. Note that each speaker has a set of attophones that are represented as vectors 
emanating from the centroid (three allophone vectors are illustrated In Figure 1). As illustrated, tiiese vectors define 
angular relationships that are often roughly comparable between different speakers. Compare angle 10 of speaker A 
witti angle 12 of speaker B. However, because the centi-oids of the respective speakers do not lie coincident witii one 
anottier, ttie resulting allophones of speakers A and B are not the same. The present invention is designed to handle 

55 this problem by removing tiie speaker-dependent idiosyncrasies characterized by different centroid locations. 

[0018] While the angular relationships among allophone vectors are generally comparable among speakers, that is 
not to say that the vectors are identical. Indeed, vector lengths may vary from one speaker to another. Male speakers 
and female speakers would likely have different allophone vector iengttis from one anottier. Moreover, tiiere can be 
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different angular relationships attributable to different speaker dialects, in this regard, compare angle 14 of speaker E 
with angle 10 of speaker A. This angular difference would be reflective, for example, where speaker A speaks a northern 
United States dialect whereas speaker E speaks a southern United States dialect. 

[0019] These vector lengths and angular differences aside, the disparity in centroid locations represents a significant 
speaker-depertdant artifact that conventional context dependent recognizers fail to address. As will be more fully ex- 
plained bek)w, the present invention provides a mechanism to readily compensate for the disparity in centroid locations 
and also to compensate for other vector length and angular drfferences. 

[0020] Figure 2 illustrates a presently preferred first embodiment that we call the Eigen centroid plus delta tree em- 
bodiment. More specifically. Figure 2 shows the preferred steps for training the delta trees that are then used by the 
recognizer. Figures 3 and 4 then show alternate embodiments for use of that recognizer with speech supplied by a 
new speaker. 

[0021] Referring to Figure 2, the delta decision trees used by this embodiment may be grown by providing acoustic 
data from a plurality of training speakers, as illustrated at 16. The acoustic data from each training speaker is projected 
or placed into an eigenspace 18. In the presently preferred embodiment the elgenspace can be truncated to reduce 
its size and computational complexity. We refer here to the reduced size elgenspace as K-space. 
[0022] One procedure for creating eigenspace 18 is illustrated by steps 20-26. The procedure uses the training 
speaker acoustic data 16 to generate speaker dependent (SD) models for each training speaker, as depicted at step 
20. These models are then vectorized at step 22. In the presently preferred embodiment, the speaker dependent 
models are vectorized by concatenating the parameters of the speech models of each speaker. Typically Hidden Markov 
Models are used, resulting in a supervector for each speaker that may comprise an ordered list of parameters (typically 
floating point numbers) corresponding to at least a portion of the parameters of the Hkiden Markov Models for that 
speaker. The parameters may be organized in any convenient order. The order Is not critk:al; however, once an order 
is adopted it must be followed for all training speakers. 

[0023] Next, a dimensionality reduction step is perfomried on the supervectors at step 24 to define the eigenspace. 
Dimensionality reduction can be effected through any linear transformation that reduces the original high-dimenslohal 
supervectors into basis vectors. A non-exhaustive list of dimensionality reduction techniques includes: Principal Com- 
ponent Analysis (PCA), Independent Component Analysis (ICA), Linear Discriminate Analysis (LDA), Factor Analysis 
(FA) and Singular Value Decomposition (SVD). 

[0024] The basis vectors generated at step 24 define an eigenspace spanned by the eigenvectors. Dimensionality 
reduction yields one eigenvector for each one of the training speakers. Thus if there are n training speakers, the di- 
menstonality reductk>n step 24 produces n eigenvectors, These eigenvectors define what we call elgenvolce space or 
elgenspace. 

[0025] The eigenvectors that make up the eigenspace each represent a different dimension across vihkh different 
speakers may be differentiated. Each supervector in the original training set can be represented as a linear combination 
of these eigenvectors. The eigenvectors are ordered by their importance in modeling the data: the first eigenvector is 
more Important than the second, which is more important than the third, and so on. 

[0026] Although a maximum of n eigenvectors is produced at step 24, in practice, it is possible to discard several of 
these eigenvectors, keeping only the first K eigenvectors. Thus at step 26 we optionally extract K of the n eigenvectors 
to comprise a reduced parameter eigenspace or K-space. The higher order eigenvectors can be discarded because 
they typically contain less important infomriation with which to discriminate among speakers. Reducing the eigenvoice 
space to fewer than the total number of training speakers helps to eliminate noise found in the original training data, 
and also provides an inherent data compresston that can be helpful when constructing practical systems with limited 
memory and processor resources. 

[0027] Having constoicted the eigenspace 18, the acoustic data of each Individual training speaker is projected or 
placed in elgenspace as at 28. The location of each speaker's data in eigenspace (K-space) represents each speaker's 
centroid or average phoneme pronunciation. As illustrated in Figure 1, these centroids may be expected to differ from 
speaker to speaker. Speed is one significant advantage of using the eigenspace technique to determine speaker pho- 
neme centrokJs. 

[0O28] The presently prefen-ed technique for placing each speaker's data within eigenspace involves a technique 
that we call the Maximum Likelihood Estimation Technique (MLED). In practical effect, the Maximum Likelihood Tech- 
nique will select the supervector within eigenspace that Is most consistent with the speaker's input speech, regardless 
of how much speech is actually available. 

[0029] To illustrate, assume that the speaker is a young female native of Alabama. Upon receipt of a few uttered 
syllables from this speaker, the Maximum Likelihood Technique will select a point within elgenspace that represents 
all phonemes (even those not yet represented in the input speech) consistent with this speaker's native Alabama female 
accent. 

[0030] The Maximum Likelihood Technique employs a probability function Q that represents the probability of gen- 
erating the observed data for a predefined set of Hidden Markov Models. Manipulation of the probability function Q is 
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made easier if the function includes not only a probability term P but also the logarithm of that temn, tog P. The probability 
function is then maximized by taking the derivative of the probability function individually with respect to each of the 
eigenvalues. For example, if the eigehspace is on dimension 100 this system calculates 1 00 derivatives of the prob- 
ability function Q, setting each to zero and solving for the respective eigenvalue W. 

[0031] The resulting set of Ws, so obtained, represents the eigenvalues needed to identify the point in elgenspace 
that corresponds to the point of maximum likelihood. Thus the set of Ws comprises a maximum likelihood vector in 
eigenspace. This maximum likelihood vector may then be used to construct a supervector that corresponds to the 

optimal point In eigenspace. 

[0032] In the context of the maximum likelihood framework of the inventton, we wish to maximize the likelihood of 
an observation O with regard to a given model. This may be done iterativety by maximizing the auxiliary function Q 
presented betow. 



q(a,a)^ Z ?(0,ff \A)\og \PiO,0 \A)\ 

0 c «xsfrf 

A . ■ 

where \ Is the model and \ is the estimated model. 

[0033] As a preliminary approximation, we might want to carry out a maximization with regards to the means only. 
In the context where the probability P is given by a set of HMMs. we obtain the following: . 



Q{^Xj.\^ const -\P{P\X)^ i; i; W'kO["log(2;r) + logic J'MWi(a.,m 



where: 



and let 



0( be the feature vector at time t 

I^J^)-^ be the inverse covariance for mixture Gaussian m of state s 

V'^^ . be the approximated adapted mean for state s, mbcture component m 

yJ^(0 be the P(using mix Gaussian m|X,Ot) 

[0034] Suppose the Gaussian means for the l-fMMs of ttie new speaker are located in eigeh^ce. Let this space 
be spanned by the mean supervectors jl, with j=1 ...E, 



»»7 = 



' I 



I 



where il^^^'W represents the mean vector for the mixture Gaussian m in the state s of the eigenvector (eigenmodel) J. 



[QOSS] Then we need: 
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[0036] The p.y are orthogonal and the w, are the eigenvalues of our speaker model. We assume here that any new 
speaker can be modeled as a linear oomblnatbn of our database of observed speakers. Then 



is 



25 



30 



40 



45 



with s in states of X, m in mixture Gaussians of M. 

[0037] Since we heed to maximize Q, we Just need to set 



dw 

(Note that because the eigenvectors are orthogonal, — i = 0, /..) 
Hence we have 



0M\ ^ ^ ^ 




Computing the above derivative, we have: 
from which we find the set of linear equations 



e=l..E. 



[0038] Once the centroids for each speaker have been detennined, they are subtracted at step 30 to yield speaker- 
adjusted acoustic data. Referring to Figure 1, this centroid subtraction process will tend to move all speakers within 
speaker space so that their centroids are coincident with one another. This, in effect, removes the speaker idiosyncra- 

50 sies, leaving only the altophone-relevant data. 

[0039] After all training speakers have been processed in this fashion, the resulting speaker-adjusted training data 
is used at step 32 to grow delta decision trees as illustrated diagrammatically at 34. A decision tree Is grown in this 
fashion for each phoneme. The phoneme *ae* Is illustrated at 34. Each tree comprises a root node 36 containing a 
question about the context of the phoneme (i.e., a question about the phoneme's neighbors or other contextual infor- 

55 mation). The root node questfon may be answered either "yes" or "no", thereby branching left or right to a pair of child 
nodes. The child nodes can contain additional questions, as illustrated at 38, or a speech model, as Illustrated at 40. 
Note that all leaf nodes (nodes 40, 42, and 44) contain speech models. These models are selected as being the models 
most suited for recognizing a particular atlophone. Thus the speech models at the leaf nodes are context-dependent. 
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[0040] After the delta decision trees have been developed, as illustrated in Figure 1, the system may be used to 
recognize the speech of a new spealcer. Two recognizer embodiments will now be described with reference to Figures 
3 and 4. The recognizer embodiments different essentially in whether the new speaker centroid is subtracted from the 
acoustic data prior to context-dependent recognition (Fig. 3); or whether the centroid Infonnatton Is added to the context- 
5 dependent models prior to context-dependent recognition. 

[0041] Refening to Figure 3, the new speaker 50 supplies an utterance that Is routed to several processing blocks, 
as illustrated. The utterance Is supplied to a speaker-Independent recognizer. 52 that functions simply to initiate the 
MLED process. 

[0042] Before the new speaker's utterance is submitted to the context-dependent recognizer 60, the new speaker's 
10 centroid Infomiation is subtracted from the speaker's acoustic data. This is accomplished by calculating the position 
of the rtew speaker within the eigenspace (K-space) as at 62 to thereby determine the centroid of the new speaker as 
at 64. Preferably the previously described MLED technique Is used to calculate the position of the new speaker in K- 

space. 

[0043] Having detemiined the centroid of the new speaker, the centroid data is subtracted from the new speaker's 
15 acoustic data as at 66. This yields speaker-adjusted acoustic data 68 that is then submitted to the context-dependent 
recognizer 60. 

[0044] The alternate embodiment illustrated at Figure 4 works in a somewhat similar fashion. The new speaker's 
utterance is submitted to the speakier-independent recognizer 52 as before, to initiate the l^LED process. Of course, 
if the MLED process is not being used in a particular embodiment the speaker-independent recognizer may not be 
20 needed. 

[0045] Meanwhile, the new speaker's utterance is placed into eigenspace as at step 62 to determine the centroM of 
the new speaker as at 64. The centroid information is then added to the context-dependent models as at 72 to yield a 
set of speaker-adjusted context-dependent models 74. These speaker-adjusted models are then used by the context- 
dependent recognizer 60 in producing the recognizer output 70. Table I below shows how exemplary data items for 
25 three speakers may be speaker-adjusted by subtracting out the centroid. All data items in the table are pronunciations 
of the phoneme 'ae' (in a variety of contexts). 

[0046] Figure 5 then shows how a delta tree might be constructed using this speaker-adjusted data. Figure 6 then 
shows the grouping of the speaker-adjusted data in acoustic space. In Figure 6 -i-l means next phoneme; the fricatives 
are the set of phonemes {f, h. s, th, ...}; voiced consonants are {b, d, g, ...}. 

30 .... 



TABLE 1 . 



Spkii : centroid = (2,3) 




•half" = > 


<h *ae f> 


(3.4) 


-(2,3) 


= (1.1) 




■sad" = > 


<s *ae d> 


(2.2) 


-(2,3) 


= (0,-1) 




•fat" = > 


<f*aet> 


(1.5.3) 


-(2,3) 


= (-0.5.0) 


Spkr2 : centroid = (7,7) 




'math' = > 


< m *ae th> 


(8.8) 


-(7.7) 


= (1.1) 




'babble' = > 


< b *ae b l> 


(7.6) 


-C7,7) 


= (0.-1) 




'gap' = > 


< g *ae p> 


(6.5. 7) 


-(7,7) 


= (-0.5,0) 


SpkrS : centroid = (10,2) 




•tasit' = > 


< t *ae s lo 


(11.3) 


-(10,2) 


= (1.1) 




'cad' = > 


< l<*aed> 


(10,1) 


-(10,2) 


= (0.-1) 




tap' = > 


< t *ae p> 


(9.5.2) 


-(10,2) 


= (-0.5.0) 



[0047] If desired, standard deviations as well as means may be used in the speaker-adjustment process. This would 
be done, for example, by imposing a unit variance condition (as in cepstral normalization). After speaker-dependent 
centroid training, the supen^ectors submitted to MLED would contain standard deviations as well as means. For each 
training data item, after subtraction of the phoneme staite centroid from each data Item, the item would be further 
adjusted by dividing by centroid standard deviations. This would result in even more accurate pooling of allophone 
data by the trees. There would be some computational costs at run time when using this technique, because the 
speaker-adjustment of incoming frames would be slightly more complex. 

[0048] As previously noted, co-articulation can be affected by speaker type In a way that causes the direction of the 
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allophone vectors to differ. This was illustrated in Figure 1 wfierein the angular relationships of offset vectors differed 
depending on whether the speaker was from the north or from the south. This phenomenon may be taken Into account 
by including decision tree questions about the eigen dimensions. Figure 7 shows an exemplary delta decision tree that 
Includes questions about the eigen dimensions in determining which model to apply to a particular allophone. In Figure 

5 7, questions 80 and 82 are eigen dimension questions. The questions ask whether a particular eigen dimension (In 
this case dimension 3) Is greater than zero. Of course, other questions can also be asked about the eigen dimension. 
[00491 Another embodiment of the Invention will now be described In connection with Figure 8. This emtwdiment is 
suited for applications in which there is a sufficient quantity of data per speaker to train reasonably accurate speaker- 
dependent models. In this embodiment it is not necessary to ascertain the centrolds of each speaker. 

10 [0050] However, in order to employ the elgenvoice technique, it is necessary to have a set of supervectors (one from 
each training speaker-dependent model). These supervectors must have the same dimension and must be aligned In 
the same sense that the Index i must represent the same parameter across all speaker-dependent models. 
[0051] Therefore, to grow a good context-dependent allophone tree for a given phoneme that is sharable across 
speakers, this embodiment pools data across speakers, but keeps track of which data item came from which speaker. 

15 The maximum likelihood estimation (MLE) criterion for choosing a question is thus extended to accumulate an overall 
score for each test question, while separately evaluating and retaining the scores for individual speakers. Figure 8 
Illustrates the technique. 

[0052] Referring to Figure 8, the decision tree structure is grown by providing a pool of questions 1 00. These ques- 
tions are Individually tested by the tree-growing algorithm, to determine which questions best define the structure of 
20 the allophone trees. 

[0053] The pool of questions Is examined, one question at a time, through an iterative technique. Thus the system 
of Figure 8 includes iterator 102 for selecting a question from pool 100 so that It may be tested. The cun-ent question 
under test is Illustrated at 1 04. 

[0054] Recall that each test question may relate in some way to the context in which a particular phoneme occurs. 
25 Thus the test question might be, for example, whether the given phoneme is preceded by a fricative. The tree-growing 
algorithm grows individual trees for each phoneme, starting with a root node question and proceeding to additional 
nodes, as needed until the allophones of that phoneme are well-represented by the tree structure. Selection of the root 
node question and any Intemnedlate node questions proceeds as illustrated In Figure 8. 

[0055] The procedure for selecting test questions works by assuming that the cunent question under evaluation 
30 (question 104) has been chosen for that node of the tree. Speaker data from the training speakers 106 are evaluated 
by the test question 104 to thereby split the speech data into two portions: a portion that answered "yes" to the test 
question and a portion that answered "no° to the test questton. Speech models are then constmcted using the test 
speaker data. Specifically, a "yes" model 106 and a "no" model 108 are constructed for each speaker. This Is different 
from the conventional procedure In which data for all speakers is pooled and for a given question, one "yes" and one 
35 "no" model are trained from the pooled data. The models are trained by training acoustic features on all the speech 
data examples that answer "yes" to the test question and similarly training another set of acoustic features on the data 
that answersi "no" to the test question. 

[0056] After having generated a "yes" model 106 and a "no" model 108 for each speaker, the system calculates the 
probability score of all the "yes" data given the "yes" model 106 and also calculates the probability score of all the "no" 

fo data given the "no" model 108. A high probability score means that the constructed model is doing a good job of 
recognizing its portion of the training data. A low probability score means that the model, while perhaps the best model 
that could be created using the training data, Is not doing a good job of recognizing the phoneme In question. 
[0057] The probat)ility scores are assessed to compute the overall score for the test question 1 04. The computation 
proceeds as illustrated In Figure 8 as follows. First the respective probability scores for the "yes" model and the "no" 

45 model are computed for a first training speaker (speaker A). These scores are multiplied together to give a cumulative 
product score Indicative of how well the models worked with speaker A. This is illustrated at 1 12. The same procedure 
is then followed for tiie remaining training speakers, one speaker at a time, as illustrated at 1 14 and 116. Rnally. when 
all of the training speakers have been taken into account an overall score Is computed by multiplying tiie resultant 
products derived from individual speakers. Thus tine products ascertained at steps 112, 114 and 116 are multiplied 

50 together to yield an overall score for the test question at 118. 

[0058] Haying generated an overall score for \he first test question. Iterator 102 stores ttie overall score results and 
then draws a second question from the pool of questions 100 for testing in the same fashion. When all questions In 
the pool have been tested, the questton ttiat gave the best overall score is selected for that node of the decision tree. 
[0059] After ttie root node of the decision tree has been determined as described above, tiie Iterator 1 02 may continue 

55 to assess whether furttier intemiediate nodes produce sufficient improvements In allophone recognition to wan-ant 
adding additional nodes to the tree. Ultimately, when the tree is grown in this fashion, tiie leaf nodes contain the models 
tfiat best "recognize" ttie allophones of a particular phoneme. 

[0060] After the deciston tree structures have been kientified through the foregoing procedures, tiie elgenvoice tech- 
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nique may now be applied. If a single Gaussian per leaf node is sufficient to represent the allophone, the aliophonic 
speaker-dependent models are trained using the shared tree structure to obtain a set of supervectors which are then 
used to construct the eigenspace through dimensionality reduction. With training now complete, the online step is a 
simple MLED estimation of the eigenvolce coefficients. I^ultlple Gaussian models are slightly more complicated be- 
cause the question of alignment must be addressed. That is, whereas it is known that leaf node N of speaker-dependent 
model 1 and leaf node N of speaker dependent model 2 represent the same allophone, it is not certain that Gaussian 
i of leaf N in speaker-dependent model 1 represents the same phenomenon as Gaussian i of leaf N in speaker de- 
pendent model 2. One way to address this issue is to find a centrold for each leaf of the speaker-dependent model 
and then speaker-adjust the data reaching all leaves. One would then pool data for a given leaf across speaker-de- 
pendent models and calculate shared delta Gaussians. At run time, MLED would yield estimates of ail leaf centrokls, 
which could then be subtracted from the new speaker's data before it is evaluated against the delta Gaussians. 

Claims 

1 . A method for developing context-dependent models for automatic speech recognitton, comprising: 

generating an eigenspace (1 8) to represent a training speaker population; 

providing a set of acoustic data (1 6) for at least one training speaker and representing said acoustic data Jn 
said eigenspace (28) to determine at least one allophone centrold for said training speaker; and 
subtracting said centrold from said acoustic data (30) to generate speaker-adjusted acoustk: data for said 
training speaker; 

using said speaker-adjusted acoustic data to grow at least one decision tree (32) having leaf nodes containing 
context-dependent models for different allophoniBS. 

2. The method of claim 1 further comprising using a set of acoustic data for a plurality of training speakers to generate 
said speaker-adjusted acoustic data for each of sad plurality of training speakers. 

3. The method of claim 1 wherein said eigenspace is generated by constructing supervectors (22) based on speech 
from said training speaker population and performing dimensionality reduction (24) upon said supervectors to 
define a reduced dimensionality space that spans said training speaker population, 

4. A niethod of performing speech recognition using saki context-dependent models developed as recited in claim 
1, comprising: 

providing speech data from a new speaker (50); 

using said eigenspace (62) to determine at least one new speaker centrold (64) of a new speaker and sub- 
tracting said new speaker centrold from said speech data (66) from sakl new speaker to generate speaker- 
adjusted data (68); and 

applying said speaker-adjusted data to a speech recognizer (60) employing said context-dependent models 
(58). 

5. A method of performing speech recognition using saki context-dependent models developed as recited in claim 
1, comprising; 

providing speech data from a new speaker (50); 

using said eigenspace (62) to determine at least one new speaker centrold (64) of a new speaker and adding 
said new speaker centrold to said context-dependent models (72) to generate new speaker-adjusted context- 
dependent models (74); and 

applying said speech data to a speech recognizer (60) employing said new speaker-adjusted context-depend- 
ent models (74). 



Patentanspruche 

1. Verfahren zur Entwicklung kontextabhSngiger Modelle zur automatischen Spracherkennung, aufweisend: 
Erzeugen eines Eigenraums (18). um eine Schulungssprecherpopulation zu reprisentieren; 
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Bereitstellen einer Menge akustischer baten (1 6) fOr mindestens einen Schulungssprecher und Darsteilen der 
akustischen Daten im Eigenraum (28), urn mindestens einen Ailophon-Schwerpunkt f Or den Schulungsspre- 
cher zu bestimmen; und 

Subtrahieren des Schwerpunlcts von den akustischen Daten (30), um sprecherangepasste akustlsche Daten 
fur den Schulungssprecher zu erzeugen; 

Venvenden der sprecherangepassten akustischen Daten, um mindestens einen EntscheidungsbSum (32) mit 
Blattknoten zu zQchten, die kontextabh&nglge Modelie fOr verschledene Aibphone haben. 

2. Verfahren nach Anspruch 1 , femer die Venft/endung einer Menge akustischer Daten fur eine Mehrzahl Schulungs- 
sprecher aufwelsend, um sprecherangepasste akustische Daten fur jeden Sprecher der l\4ehrzahl Schulungsspre- 
cher zu erzeugen. 

3. Verfahren nach Anspruch 1 , bei dem der Eigenraum durch die Konstruktion von Supervektoren (22) auf Basis von 
Sprache der Schulungssprecherpopulation erzeugt und eine Dimensionalitatsverringerung (24) dieser Supervek- 
toren ausgefuhrt wird, um einen verringerten DimensionalitStsraum zu deflnieren, der die Schulungssprecherpo- 
puiatton abdeckt 

4. Verfahren zur AusfOhmng von Spracherkennung unter Verwendung der kontextabhSngigen l^odelie, die gemdB 
Anspruch 1 entwickett worden sind, aufweisend: 

Bereitstellen von Sprachdaten von einem neuen Sprecher (50); 

Venvenden des Elgenraums (62) zur Bestimmung mindestens eines neuen Sprecher-Schwerpunkts (64) und 
Subtrahieren des neuen Sprecher-Schwerpunkts von den Sprachdaten (66) des neuen Sprechers, um spre- 
cherangepasste Daten (68) zu erzeugen; und 

Llefern der sprecherangepassten Daten an einen Spracherkenner (60), in dem die kontextabh§ngigen Modelie 
(58) venwendet werden. 

5. Verfahren zur AusfOhrung von Spracherkennung unter Venvendung der kontextabh^ngigen Modelie, die gemdB 
Anspruch 1 entwickelt worden sInd, aufweisend: 

Bereitstellen von Sprachdaten von einem neuen Sprecher (50); 

Venivehden des Eigenraums (62) zur Bestimmung mindestens eines neuen Sprecher-Schwerpunkts (64) eines 
neuen Sprecher und Addieren des neuen Sprecher-Schwerpunkts zu den kontextabhdngigen Modellen (72), 
um neue sprecherangepasste kontextabhSngige Modelie (74) zu erzeugen; und 

Liefern der Sprachdaten an einen Spracherkenner (60), in dem die neuen sprecherangepassten kontextab- 
h&ngigen Modelie (74) venvendet werden. 



Revendications 

1 . Proc^d^ de d^veioppement de modules dependant du contexte pour la reconnaissance automatique de la parole, 
comprenant : 

la production d'un espace propre (18) pour representor une population de locuteurs d'apprentissage ; 
la foumiture d'un ensemble de donn^es acousttques (16) pour au moins un locuteur d'apprentissage et la 
representation des donn^es acoustlques dans I'espace propre (28) de fagon k determiner au moins un cen- 
troTde d'allophones pour le locuteur d'apprentissage ; et 

ia soustraction du centroTde aux donn^es acoustlques (30) pour produire des donn^es acoustk^ues ajustees 
en fonction du locuteur pour le bcuteur d'apprentissage ; 

Tutilisation des donnees acoustlques ajustees en fonction du locuteur pour faire crottre au moins un arbre de 
decision (32) ayant des noeuds d'extremite contenant des modules dependant du contexte pour differents 
allophones. 

2. Precede suivant la revendication 1 . comprenant en outre. I'utiiisatlon d'un ensemble de donnees acoustlques pour 
une pluralite de k>cuteurs d'apprentissage afin de produire des donnees acoustlques ajustees en fonction du lo- 
cuteur pour chacun de la pluralite de k>cuteurs d'apprentissage. 

3. Precede suivant la revendication 1 , dans lequel i'espace propre est produit en construisant des supervecteurs (22) 
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bas6s sur la parole provenant de la population de locuteurs d'apprentlssage et en effectuant une r^uction de 
dimenslonnalitd (24) sur ces supervecteurs pour ddflnir un espace de dlmensionnallt^ r^duite couvrant la popu- 
lation de locuteurs d'apprentissage. 

Proc^6 de reconnaissance de la parole utillsant les modules dependant du contexte d6velopp6s k la revendlcation 
1 . comprenant : 

la foumiture de donn6es vocales par un nouveay^.locuteur (50) ; . 

I'utillsation de I'espace propre (62) pour determiner au molns un centroTde de nouveau locuteur (64) et sous- 
tralre ce centroTde de nouveau locuteur aux donn^es (66) vocales provenant du nouveau locuteur pour pro- 
dulre des donnas (68) ajustdes en fonctton du locuteur ; et 

rappllcation des donn^es ajust^es en fonctlon du locuteur k un dispositif (60) de reconnaissance de ta parole 
utillsant les modules (58) dependant du contexte. 

Proc6d§ pour effectuer la reconnaissance de la parole utillsant les modules dependant du contexte d^velopp^s 
k la revendlcation 1 , comprenant : 

la foumiture de donn^es vocales par un nouveau locuteur (50) ; 

i'utillsation de i'espace propre (62) pour d6tenniner au moins un centroTde (64) de nouveau locuteur d'un 
nouveau locuteur et ajouter ce centroTde de nouveau locuteur aux modules (72) dependant du contexte pour 
produire des modules (74) dependant du contexte et a]ust6s en fonction du nouveau locuteur ; et 
rappllcation des donn^ies vocales k un dispositif (60) de reconnaissance de la parole en utillsant les modules 
(74) dependant du contexte et ajust^s en jfonctidn du nouveau locuteur. 



11 



EP 1 103 952 B1 




12 



EP 1 103 952 B1 



6 

I— c 



i2 O }r IP 

Q c g 

> O o_ :^ 

O 3 

»- U C 

(J o 5 




00 



E 5 2 S 

Qi C ^ C7 



^. E «^ 



Q ^ uj 




3:2 ^ 

3 O ^ 

C Q» ^ 



± 



E 2 S 

£ £ i« 






3 -5 

o c ^ 

Q jn I— Q> 

r- ^ (U (— 

•i ^ "o "^5 



O 



13 



EP1 103 952B1 




0^ 

00 ^ 

u o 



00 



u 
D 



8 \J 



0^ c a> 
I -So 

Q 



r 



CM 



o 





ent 


izer 




0) 








cz 


r— 
SO 










O) 
CL 




CO 


\ 


LD 


ncl 


Re 





-2 .9 §. 








w 








£ 




a 




"5 


CL 






CO 


Cen 




o 


Z 







c 




CD 




"O 




r- 


cu 


oJ 




CL 




-de 


ogr 


text 


Rec 






o 




U 







from 






u 




Dal 


OS 




pe 


u 


Jo 




CO 




D 


k» 


New 


D 




c 


o 




Ce 


Ac 



-it: t;; 
a — ► 

Q.-0 



u 



in 

o 

< 



oo 



OO 



14 



EP1 103 952B1 



52 



Speaker 
Independent 
Recognizer 



62 



56- 



Calculate 
Position 
in K-Space 









Delta 




Decision 




Trees 



Determine 
Centroid of 
New Speaker 



64 



r 



34 




Context 
Dependent (CD) 
Models 



58 



72- 



Add Centroid 
to CD Models 



Speaker- 
Adjusted 
CD Models 



60 



74 



Context-dependent 
Recognizer 



70 



Recognizer 

Output 



FIG. 4 



15 



EP 1 103 952 B1 




Is -i- 1 voiced consonant? 



Model CI 
(spkrl: <h *aef>, 
spl<r2: <m *aeth>, 
spkr3: <t *aes k>) 




Model G2 
(spkrl: <s *aed>, 
spkr2: <b ♦aebl>, 
spkr3: <k *aed>) 

FIG. 5 




Model C3 
(spkrl: <f *aet>, 
spkr2: <g "ae p>, 
spkr3: <t *ae p>) 



Model CI 



V ° / 



Model C3 "N 



\ 



^ 



I o ) Model C2 



FIG, 6 



16 



EP 1 103 952 B1 




A A A A 

m ^ Cl CL ^ CL CL 

G (Q 0) a> (U Q) a> 

♦ (0 (0 C9 

. « » » « « 

"O V ^ ^ 00 
O •••••• 

:s - ^ ^ T ^ ^ 
^ ^ 

Q. ui hr </i 



c3 




S <U 




A A 




a> c: — a. 

■§ V V V 

5 ^ <N CO 
Ql 5- CL 



« 

E — CO 

. . V/ V / 



A^ A A 




m 

H *- 
^ jsk: -iri 
0-0.0. 



17 



EP 1 103 952 B1 



Pool of 
Questions 



104- 




106-^ 


Yes 




No ^ 




Model 




Model 








rrobabilityN i 


/Probability\ 




Score V ' 


Score V 



108 




112 



114 



116 




118 



FIG. 8 



18 



This Page is Inserted by IFW Indexing and Scanning 
Operations and is not part of the Official Record 



Defective images within this document are accurate representations of the original 
documents submitted by the appHcant. 

Defects in the images include but are not limited to the items checked: 

□ BLACK BORDERS 

□ IMAGE CUT OFF AT TOP, BOTTOM OR SffiES 



L^FADED TEXT OR DRAWING 

□ BLURRED OR ILLEGIBLE TEXT OR DRAWING 

□ SKEWED/SLANTED IMAGES 

□ COLOR OR BLACK AND WHITE PHOTOGRAPHS 

□ GRAY SCALE DOCUMENTS 



LplINES OR MARKS ON ORIGINAL DOCUMENT 

□ REFERENCE(S) OR EXHIBIT(S) SUBMITTED ARE POOR QUALITY 

□ OTHER: 



IMAGES ARE BEST AVAILABLE COPY. 
As rescanning these documents will not correct the image 
problems checked, please do not report these problems to 
the IFW Image Problem Mailbox. 



BEST AVAILABLE IMAGES 





